大模型评测#
一键评测#
- huggingface/lighteval
- open-compass/opencompass
- https://rank.opencompass.org.cn/home
- open-compass/VLMEvalKit
- modelscope/evalscope
- agi-eval
综合基准#
- chatbot-arena 包括文本、t2i、web2dev、t2v、搜索、copilot等榜单。
- https://lmarena.ai/
- chatbot-arena-leaderboard 截止2025-09-18,gemini-2.5-pro排在榜一,qwen3-max-preview是榜3
- 推理&知识
- 人类最后考试 https://agi.safe.ai/
- 视觉推理 https://mmmu-benchmark.github.io/
- 科学 https://github.com/idavidrein/gpqa
- 数学 https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions
- 代码
- 代码生成 https://livecodebench.github.io/
- 代码编辑 https://aider.chat/docs/leaderboards/
- Agent编程 https://www.swebench.com/
- [2026.01] Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Agent 在命令行环境的 benchmark
- 事实
- https://openai.com/index/introducing-simpleqa/ https://github.com/openai/simple-evals/
- 图像理解 https://github.com/reka-ai/reka-vibe-eval
- 长上下文
- 多轮一致性 https://arxiv.org/html/2409.12640v2
- 多语言
- https://huggingface.co/datasets/CohereForAI/Global-MMLU
-
Open LLM Leaderboard Archived 开放llm榜单,2024-10-17更新
- [2023.07] SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark 中文大模型榜单
- CLUEbenchmark/SuperCLUE 不更新了
-
https://www.superclueai.com/ 2025-08还在更新
-
SuperCLUE总排行榜[link]
- Text-to-Video Generation on MSR-VTT[link]
- Video Generation on UCF-101[link]
幻觉&TruthFull#
- [2025.09] Why Language Models Hallucinate openai产生幻觉是因为模型评估鼓励模型猜而不是保持谦逊。解决办法需要重构目前的所有评估集。
- [2025.05] HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection
- [2025.04] HalluLens: LLM Hallucination Benchmark meta
- [2024.11] Measuring short-form factuality in large language models openai的simpleQA,4326个问题。人工双盲标注,问答都短。多样性:话题、答案类型。答案唯一。人工标注误差3%。答错扣分。最猜答案做惩罚。
- [2024.11] DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine 生物领域的幻觉
- [2024.10] FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
- [2023.12] FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation meta.把长答案拆成小事实分别核对。自动评估模型误差不到2%。
- [2023.10] Evaluating Hallucinations in Chinese Large Language Models 复旦、上海ai实验室。
- [2023.05] Do Language Models Know When They're Hallucinating References? 检测的幻觉的方法:直接提问和问多次。
- [2023.05] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models 35k幻觉数据。用chatgpt生成幻觉数据,然后人工打标挑选最好的。
- RUCAIBox/HaluEval
- [2023] AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web
- AVeriTeC Dataset(2024) 4568个样本。有3个label:"Supported", "Refuted", "Not Enough Evidence", "Conflicting Evidence/Cherry-picking".
- [2022.07] Language Models (Mostly) Know What They Know Anthropic。越大的模型校准度越好。给自己打分,fewshot会更准。整体来看,大型语言模型确实具备一定的 “自我认知” 能力:既能判断自己的答案对不对(P (True)),也能预判自己会不会答(P (IK)),而且模型越大、给的提示越合适,这种能力就越强。不过模型也有局限,比如容易被自己生成的答案误导,换任务时校准度会下降,且目前主要还是模仿人类知识,没法区分 “事实真相” 和 “人类常说的话”。
- [2022.05] Teaching Models to Express Their Uncertainty in Words openai,“语言化概率”(verbalized probability),就是让模型用文字或数字直接说清对答案的把握,比如 “90% 信心” 或 “高信心”。
- [2021] TruthfulQA: Measuring How Models Mimic Human Falsehoods openai, 817个问题覆盖37个领域。 针对imitative falsehoods,模型越大反而表现越差。但是通过微调可以解决这些错误。92.85% 的答案能在维基百科里找到对应的标题;近 40% 的问题需要结合多段文档推理,还有 17% 需要一些常识才能答出来。
- sylinrl/TruthfulQA
- HHEM
- [2017.05] TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension 14个trivia网站找到95k的qa对,平均每个q配6个证据文档。
角色扮演#
- [2025.02] CoSER: Coordinating LLM-Based Persona Simulation of Established Roles 1.8w角色,3w对话
- [2024.08] MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents 多模态角色扮演评估,85 characters, 11K images, and 14K single or multi-turn dialogues
- [2024.01] Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment 4k个角色,3.6w的对话
- [2024.01] CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation 中文角色扮演评估,77个角色,1785个多轮对话
- [2023.12] RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models 中英双语角色评估,300个角色,6000个问题
- [2023.10] RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
- SuperCLUE-Role: 重新定义中文角色大模型测评基准
多模态#
- [2024.10] MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks 505个任务,8186个样本。
电商#
- [2025.05] TransBench: Benchmarking Machine Translation for Industrial-Scale Applications 评测体系中的数据集包含 "电商文化" 类别
- [2025.02] ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models
- [2024.10] Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models
推理#
- [2025.09] MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables 道德推理能力。现在的大模型更多的还是记答案,并没有真的懂了。越大的模型越强,但是容易出现断章取义错误。ai容易自相矛盾。推理增强模型没有更好(应该是rl没有在这类人物上训练过)。ai不敢选都不对。
- [2025.05] FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model Evaluation 评估模型在程序化的数据流上的推理能力:烹饪食谱、旅行路线和自动化计划。
安全#
- [2025.09] detecting-and-reducing-scheming-in-ai-models 发现ai的阴谋
- Stress Testing Deliberative Alignment for Anti-Scheming Training
RAG#
- [2024.09] Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation 模型在没检索辅助时,答对率只有 0.408;就算给一些检索到的文章,答对率也只到 0.474;但如果给全所有需要的文章,答对率能到 0.729,不过就算这样,模型在算数字、处理表格这类推理题上还是容易错。"多步检索" 方法,让模型一步步生成搜索词、找文章、补全信息,再结合示例引导模型 "按步骤思考",结果答对率提升到了 0.66,比最初提升了 50% 以上,很接近给全资料的理想状态。